The goal of this Tutorial is to learn some techniques to display and analyze protein-protein interactions (PPI’s) using R. We will use a Science article for it from which we will take the data.
Kim et al. 2021, Science 374, 50
This research makes use of protein-protein interactions as observed using affinity purification and mass spectrometry (AP-MS). The article focuses on new interactions.
We want to visualize PPI’s and perform some basic analysis on
it.
This plots are not going to be 100% the same as in the article, that
would involve too many expert steps. However by using R functions we can
create some PPI Maps of the AP-MS data.
Furthermore, we learn some techniques to further go into some depth on
the proteins that interact. Like what the function(s) they might have in
a cell, or in which biological process they act.
In this COO we will not generate exactly the same plots as in the article, but look at the data itself and the information it contains.
the article (pdf) is available in the folder of this tutorial
We’re interested in working with protein interaction data in R. Before we can do that we need to gain some knowledge.
A graph is the mathematical term for a structure that links things together into a kind of network - A graph in this sense is not a ‘plot’, although the word is used for that often - You can plot a graph, it looks like this:
## Nodes - Nodes
(or vertices) are the ‘things’ that are connected, for instance proteins
- Other examples are people or pieces of information, think of the
pictures in a detective series, which are connected with strings of
wire:
You can imagine describing a graph in R, is not as easy as storing a
list of numbers. We will need something more ‘smart’.
The simplest way to write down a graph, is just to write out all of the
connections. For example:
‘Protein A’ interacts with ‘Protein B’ ‘Protein A’ interacts with
‘Protein C’ ‘Protein D’ interacts with ‘Protein E’
etc.
If you think about it, the nodes of the above graph would be Proteins A, B, C, D and E, so all the unique names in the list. The edges are fully described by the listed interactions.
This, in practice, is how to present graphs to R. There is a specific library that takes this kind of information and turns it into a complex datastructure called graphNEL (https://www.bioconductor.org/packages/devel/bioc/vignettes/graph/inst/doc/graph.pdf). Having the graph in this structure makes it very easy to show the graph, and to investigate it.
Because we’re interested in the PPI described in the article, we’re giving you a ready-made function that we’re going to use in today’s Tutorial. We use terms related to the article (i.e. baits and preys and proteingraph).
Basically the graph data structure are two vectors: one with baits and one with preys similar the the interaction list above. The function turns these into a graph with the correct nodes and edges.
# Create a protein-protein interaction graph based on 2 or 3 vectors:
# - baits
# - preys
# - weights, optional, gives the strength of the edge
makeproteingraph <- function(baits, preys, weights=1) {
allproteins <- c(baits, preys)
uniqueproteins <- unique(allproteins)
# First we create a collection of all nodes, they are not connected yet
proteingraph <- new("graphNEL", nodes = uniqueproteins)
# Add edges to the graph:
# See http://rss.acs.unt.edu/Rdoc/library/graph/doc/graph.pdf for more examples
proteingraph <- addEdge(baits, preys, proteingraph, weights)
return(proteingraph)
}
To see if our function works we can create a simple graph ourselves. For that we’ll create a bait and a prey vector. We will experiment with the baits and preys to get an idea about how these graphs work.
The following is an overview of the methods we are going to use.
Remember you can get help about functions from packages, using ‘?’.
First a simple example. Functions are given to show how to create the graph structure (our function) and how to lay-out and plot the graph.
In this example ‘A’ is always the bait and it binds to ‘B’, ‘C’, ‘D’ and ‘E’. You have to think of these as pairs, so the first element of the baits is linked to the first element in the preys.
baits <- c("A", "A", "A", "A")
preys <- c("B", "C", "D", "E")
#First we use our function to create the graph
simple_example_graph <- makeproteingraph(baits, preys)
# Now we use 'layoutGraph' and 'renderGraph' from Rgraphviz to make a graphical output
example_graph_plot <- layoutGraph(simple_example_graph, layoutType="neato")
renderGraph(example_graph_plot)
Do you understand why the plot looks the way it looks?
baits <- c("A", "A", "A", "A")
preys <- c("B", "C", "D", "E")
#First we use our function to create the graph
simple_example_graph <- makeproteingraph(baits, preys)
# Now we use 'layoutGraph' and 'renderGraph' from Rgraphviz to make a graphical output
example_graph_plot <- layoutGraph(simple_example_graph, layoutType="neato")
renderGraph(example_graph_plot)
Do you understand why the plot looks the way it looks?
baits <- c("A", "A", "A", "A")
preys <- c("B", "C", "D", "E")
#First we use our function to create the graph
simple_example_graph <- makeproteingraph(baits, preys)
# Now we use 'layoutGraph' and 'renderGraph' from Rgraphviz to make a graphical output
example_graph_plot <- layoutGraph(simple_example_graph, layoutType="neato")
renderGraph(example_graph_plot)
With the libraries in memory we will start by loading in the supplementary data of the article. All the ‘raw’ data is in supplementary table S3. The other tables contain the same data, however in other stages of post-processing and hence more refined.
For the purpose of this tutorial we will use the data from supplementary data S6 (science.abf3066_Table_S6.xlsx). This excel file contains two (2) sheets, README and “Differential Interaction Score”, we need the latter.
It is good practice to investigate the content of a variable, to check if you have to correct data and/or see if everything worked as intended. For this you can use the print() or head() function. However also just typing the variable name will often also work.
## lets load our supplementary data
SuppS6 <- read_xlsx("~/Desktop/untitled folder/Week5/week5/science.abf3066_Table_S6.xlsx", sheet = "Differential Interaction Score")
head(SuppS6)
## # A tibble: 6 × 8
## Bait BaitUniprot Prey PreyUniprot DIS BFDR Cell Info
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 ARID1A O14497 SMARCE1 Q969G3 -0.0008 0.439 mcf10a mda231(0.0000:0.…
## 2 SMARCD1 Q96GM5 SMARCE1 Q969G3 -0.228 0.410 mcf10a mda231(0.0077:0.…
## 3 MSH2 P43246 GGCT O75223 -0.33 0.399 mcf10a mda231(0.0000:0.…
## 4 AKT3 Q9Y243 NSF P46459 -0.33 0.399 mcf10a mda231(0.0000:0.…
## 5 STK11 Q15831 PKP4 Q99569 -0.35 0.393 mcf10a mda231(0.0000:0.…
## 6 SMARCB1 Q12824 SS18 Q15532 -0.385 0.388 mcf10a mda231(0.1350:0.…
head(SuppS6)
## # A tibble: 6 × 8
## Bait BaitUniprot Prey PreyUniprot DIS BFDR Cell Info
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 ARID1A O14497 SMARCE1 Q969G3 -0.0008 0.439 mcf10a mda231(0.0000:0.…
## 2 SMARCD1 Q96GM5 SMARCE1 Q969G3 -0.228 0.410 mcf10a mda231(0.0077:0.…
## 3 MSH2 P43246 GGCT O75223 -0.33 0.399 mcf10a mda231(0.0000:0.…
## 4 AKT3 Q9Y243 NSF P46459 -0.33 0.399 mcf10a mda231(0.0000:0.…
## 5 STK11 Q15831 PKP4 Q99569 -0.35 0.393 mcf10a mda231(0.0000:0.…
## 6 SMARCB1 Q12824 SS18 Q15532 -0.385 0.388 mcf10a mda231(0.1350:0.…
Now we know what our function does, let’s have a look at our experimental data.
SuppS6 contains a Bait and a Prey column. It also contains two
additional columns, with an other identifier (uniprot) for these
proteins.
We will use Bait and Prey, but it will work equally well with the other
two columns.
For convenience we will first store the content of these columns in their own variables: baits and preys as we will use them in later steps again, but this step isn’t needed as we can also directly use these ‘columns’ as input in our makeproteingraph(.. , ..) function.
baits <- SuppS6$Bait
preys <- SuppS6$Prey
interactionGraph <- makeproteingraph(baits, preys)
interactionGraph
## A graphNEL graph with undirected edges
## Number of Nodes = 523
## Number of Edges = 583
We will make use of the neato layout layoutType=“neato” (there are more layout types, but ‘calculating’ these is slow).
library(Rgraphviz)
plot(interactionGraph, attrs = list(graph = list(layoutType = "neato")))
# Load necessary library
library(Rgraphviz)
# Get node names
nodeNames <- nodes(interactionGraph)
# Assign colors to nodes (you can modify this to different colors)
nodeAttrs <- list(fillcolor = setNames(rep("lightblue", length(nodeNames)), nodeNames))
# Get edge names
edgeNames <- edgeNames(interactionGraph)
# Assign colors to edges (if needed)
edgeAttrs <- list(color = setNames(rep("pink", length(edgeNames)), edgeNames))
# Plot the interaction graph with colors
plot(interactionGraph, attrs = list(graph = list(layoutType = "neato")), nodeAttrs = nodeAttrs, edgeAttrs = edgeAttrs)
## use the example code above to layout the graph, and run the renderGraph(..) function (1.)
graphLayout <- layoutGraph(interactionGraph, layoutType = "neato")
renderGraph(graphLayout)
# Load the necessary library
library(Rgraphviz)
# Get node names
nodeNames <- nodes(interactionGraph)
# Assign different colors to nodes (example: alternating colors)
nodeColors <- setNames(rep(c("lightblue","plum", "pink"), length.out = length(nodeNames)), nodeNames)
# Define node attributes (colors)
nodeAttrs <- list(fillcolor = nodeColors)
# Get edge names
edgeNames <- edgeNames(interactionGraph)
# Assign colors to edges (randomized example)
edgeColors <- setNames(sample(c("beige", "lightcoral", "purple", "orange"), length(edgeNames), replace = TRUE), edgeNames)
# Define edge attributes (colors)
edgeAttrs <- list(color = edgeColors)
# Generate a layout with attributes
graphLayout <- layoutGraph(interactionGraph, layoutType = "neato", nodeAttrs = nodeAttrs, edgeAttrs = edgeAttrs)
# Render the graph with colors
renderGraph(graphLayout)
Take a look at the graph below.. It doesn’t look the same but trust me it is the same PPI information as in figure 2A. However they have made it much better looking.
Figure 2H in the article uses the same data, however now only shows the interactions that are commonly found in the cancer lines and not in the mcf10A line. In other words, they only look at those PPI’s that are specific for cancer cells and not observed in the ‘normal’ mcf10a line.
The supplementary data has a DIS column. The autors describe that a positive DIS means that it is specific to cancer lines (README part in TABLE S6)
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
hint: We will need to subset Observations (rows).
dim(SuppS6)
## [1] 589 8
SuppS6_sub <- SuppS6 |> filter(DIS > 0.5)
head(SuppS6_sub)
## # A tibble: 6 × 8
## Bait BaitUniprot Prey PreyUniprot DIS BFDR Cell Info
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 TP53 P04637 TLN1 Q9Y490 1 0 mcf7 mda231(0.0000:0.2987)…
## 2 XPC Q01831 IMPDH2 P12268 1 0 mcf7 mda231(0.0000:0.2987)…
## 3 CDKN1B P46527 CCNB2 O95067 1 0 mcf7 mda231(0.0000:0.2987)…
## 4 CDKN1B P46527 CCNA2 P20248 1 0 mcf7 mda231(0.0000:0.2987)…
## 5 MTDH Q86UE4 RPLP0 P05388 1 0 mcf7 mda231(0.0000:0.2987)…
## 6 PIK3CA P42336 IGHG1 P01857 1 0 mcf7 mda231(0.0000:0.2987)…
dim(SuppS6_sub)
## [1] 369 8
We have our subset, which should contain only the PPIs found in cancer lines. Of course we like to plot this as well, like we did with the complete dataset.
All the steps above are similar to the steps you did for the ‘complete’ dataset
Similar to the figure in the article we would like to highlight the baits a bit better. Also the preys are a bit crowded.
Below you find an example code that takes the baits, makes them blue, and changes the circle nodes into rectangles. It also changes the font size for the baits and makes the font for the preys very small (invisible).
However for this tutorial I will provide the code (below). Try to see if you understand the code. Run it and maybe play with the code a bit to change the visualization to your liking.
In short this is what the code below does:
The nitty gritty details are in: https://www.bioconductor.org/packages/devel/bioc/vignettes/Rgraphviz/inst/doc/newRgraphvizInterface.pdf
SuppS6_sub <- filter(SuppS6, DIS > 0.5)
ppiGraph <- makeproteingraph(SuppS6_sub$Bait, SuppS6_sub$Prey, SuppS6_sub$DIS)
ppiGraph <- layoutGraph(ppiGraph, layoutType="neato") #layoutType="fdp" is also cool, but slow, try it out!
## formatting the prey nodes (SuppS6_sub$Prey)
preyFont <- rep(1, length(SuppS6_sub$Prey)) #assign a very small font size to the prey nodes
names(preyFont)<-SuppS6_sub$Prey
## formatting the bait nodes (SuppS6_sub$Bait)
fill <- rep("lightblue", length(SuppS6_sub$Bait)) #assign the blue color to the bait nodes
names(fill) <- SuppS6_sub$Bait
shape <- rep("box", length(SuppS6_sub$Bait)) #assign the box shape to the bait nodes
names(shape) <- SuppS6_sub$Bait
baitFont <- rep(48, length(SuppS6_sub$Bait))#assign the font size to the bait nodes
names(baitFont) <- SuppS6_sub$Bait
lwd <- rep(2, length(SuppS6_sub$Bait)) #make the borders around the bait nodes squares 2px thick.
names(lwd) <- SuppS6_sub$Bait
## the nodeRenderInfo takes these list with parameters (settings) and uses that format the graph.
nodeRenderInfo(ppiGraph) <- list(fontsize=preyFont)
nodeRenderInfo(ppiGraph) <- list(fontsize=baitFont)
nodeRenderInfo(ppiGraph) <- list(fill=fill)
nodeRenderInfo(ppiGraph) <- list(shape=shape)
nodeRenderInfo(ppiGraph) <- list(lwd=lwd)
nodeRenderInfo(ppiGraph) <- list(lty=1)
## some additional markup stuff, set at title and subtitle.
graph.par(list(nodes=list(fontsize=36), graph=list(main="PPI NETWORK", sub="DIS > 0.5", cex.main=1.8, cex.sub=1.4, col.sub="gray")))
## finally we use the renderGraph function to draw the graph again
renderGraph(ppiGraph)
Maybe this plot is not 1:1 the same look and feel as the article one, but it is much more informative like this.
We can get a lot of information from the plots - What connects to individual nodes? - How many connections does a node have? (Are they key players?) - What is the distribution of the amount of connections? (tells you something about )
We’ve loaded the ‘graph’ package and are therefore using GraphNEL graphs. The number of connections per node can be found with the ‘degree’ function. Sort the output of this function to give a nice overview of what proteins have the most connected proteins.
In R, there is a function called ‘table’, that will give you the possibility to count frequencies of something. Here, we can do that with the degrees, so we can summarize how many 1s, 2s and other numbers are there. Use this function to analyze the output of the degree function from above
hist(sort(degree(ppiGraph)))
How many proteins with a degree of ‘0’ do you find? How many proteins with a degree of 1? How many with a degree greater than 40? Explain why you see this. Think about it, also in terms of the experiment we’re analyzing
Type your answer here!
Extract the connected components from our graph using the connectedComp method. What do you think connected components are? Assign the outcome to a variable with a name you will still understand when we wake you up at night (hypothetically of course!).
Read about the ‘adj’ function, and extract all interactors of the TP53 protein.
#START_ANSWER
adj(ppiGraph, 'TP53')
## $TP53
## [1] "TLN1" "FARSB" "SARS" "PPP1R10" "CLUH" "ATIC" "HGS"
## [8] "SEC24C" "L1RE1" "LENG1" "TOX4" "GSTO1" "AHCY" "JPT2"
## [15] "OLA1" "PIP4K2C" "DRG1" "IDH1" "44079" "WDR1" "GSPT1"
## [22] "GAK" "SEC31A" "EVL" "STRAP" "MTMR14" "SNX5" "ZC3H15"
## [29] "ANK3" "NELFA" "FBXW8" "GDF15" "CSNK2A2" "USP7" "RPL5"
## [36] "EPPK1" "DDX60" "CDKN1A" "LRP1" "DNAJB1" "SUPT5H" "DNAJA3"
#END_ANSWER